{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "63tq0KVRnucw"
      },
      "source": [
        "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/learn/search/semantic-search/ner-search/ner-powered-search.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/learn/search/semantic-search/ner-search/ner-powered-search.ipynb)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zMAEGl3xfu6u"
      },
      "source": [
        "# NER Powered Semantic Search"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jRtbBZg0NUiy"
      },
      "source": [
        "This notebook shows how to use Named Entity Recognition (NER) for hybrid metadata + vector search with Pinecone. We will:\n",
        "\n",
        "1. Extract named entities from text.\n",
        "2. Store them in a Pinecone index as metadata (alongside respective text vectors).\n",
        "3. We extract named entities from incoming queries and use them to filter and search only through records containing these named entities.\n",
        "\n",
        "This is particularly helpful if you want to restrict the search score to records that contain information about the named entities that are also found within the query.\n",
        "\n",
        "Let's get started."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_iEQFegogJ7v"
      },
      "source": [
        "# Install Dependencies"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kj2fUHdd_wzL",
        "outputId": "10d4ad4d-a1a9-45d3-8793-6ffa7814ed66"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m86.0/86.0 kB\u001b[0m \u001b[31m1.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Preparing metadata (setup.py) ... \u001b[?25l\u001b[?25hdone\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m179.1/179.1 kB\u001b[0m \u001b[31m9.1 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m519.3/519.3 kB\u001b[0m \u001b[31m32.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m7.4/7.4 MB\u001b[0m \u001b[31m96.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m65.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m268.8/268.8 kB\u001b[0m \u001b[31m27.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m60.0/60.0 kB\u001b[0m \u001b[31m7.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m300.4/300.4 kB\u001b[0m \u001b[31m34.4 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m115.3/115.3 kB\u001b[0m \u001b[31m13.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m194.1/194.1 kB\u001b[0m \u001b[31m22.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m134.8/134.8 kB\u001b[0m \u001b[31m17.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m7.8/7.8 MB\u001b[0m \u001b[31m81.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[2K     \u001b[90m\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u2501\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m64.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
            "\u001b[?25h  Building wheel for sentence_transformers (setup.py) ... \u001b[?25l\u001b[?25hdone\n"
          ]
        }
      ],
      "source": [
        "!pip install sentence_transformers pinecone-client==3.1.0 datasets -qU"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0YXIqN6DgPsc"
      },
      "source": [
        "# Load and Prepare Dataset"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "hI5m1Qb0QaSF"
      },
      "source": [
        "We use a dataset containing ~190K articles scraped from Medium. We select 50K articles from the dataset as indexing all the articles may take some time. This dataset can be loaded from the HuggingFace dataset hub as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 177,
          "referenced_widgets": [
            "595a13d607884046acf2a81b0d329510",
            "c77f05e54728446a8fd0cd7c97bc8d24",
            "ee8fb21671774d549e80adc9836531b6",
            "260b4eafd25c4ae889b10c0c6d9cbb1a",
            "80b4c8d5ae7441e59f785dd98abe6755",
            "b2a5ccd15c0e424d8743b18154a4f33a",
            "017a6dc721914357a739ad27277752e1",
            "0e6bc2b9643945e5a2bed8486016ac63",
            "ba212b42c31c4970853b9e15ea42e413",
            "543dab16feec4ac39593ba9b8f946e26",
            "9ed62ad19d274bd38d38a27440953517",
            "0210760e95f04099990c786d9a0d39c8",
            "a98286dfd29a4dae93440e7f8dae0966",
            "757584ad98ec481c9fd5da905c25110f",
            "dfedb97223784997bc62a300b35485db",
            "0dc06dbb2765431c92901591f1c43fa1",
            "ea09a5537edd4d8ead281fb721090424",
            "29897c1ab3924d5888b9fa1dc32c4cc7",
            "f136c1b328494e40bc1443cd9a2f2bce",
            "00bfafd0ecc348d8bd2696dd90dd1620",
            "f5a9a883511547ab81e1ed94450b75ce",
            "7251a9ab66b1449698b9ac8820d794a2",
            "1debb1769d3944e2aeb3f5ad9b67f413",
            "d31f942d08de411ca200fdd648d15047",
            "6a7c703eb75e4f8a8958bd020ab4c431",
            "cd7316a4e4a7436f933d51b01632ab38",
            "505b65039c334d60a236a0d294806b72",
            "df7199cb469a4bac957658777a684d5f",
            "0c11a6cdb1c2479a85215b22a99cde10",
            "94c37c33142949179b3fc6525604bb19",
            "9cad585217e74172afdab991e05c3783",
            "601e5b52147e4aacbe52a3a2fd4174a5",
            "5ccbfef397e148bfbbd51b34445d27c6",
            "e66f580d79ac486d94405032dc946330",
            "c35674279cc34689824df152a83ea246",
            "2782f8409875418898a9460d769a15f6",
            "efb998d8c517475e9fb6871c2f2a8907",
            "f2a5ba6202654349ae05e2f540abab76",
            "5c71483cfe4b448fb8856f4a82c347d7",
            "52b9faf839c843f5a6c66ffaccc8b862",
            "8527737d18cd4963a5f39a0620fbe585",
            "798a25fd2f9e42c6ac030f546a8d8b9f",
            "1fdbec4059044816869935b3b9ad443c",
            "6d7e26a3ce764ce4bb2b341137a841e1",
            "73c95d356a394c01a4ac58af8d4b0e08",
            "5a5ea21e6ee046db8bb5ba1b28bd442b",
            "29acd0abd9784dc4bc0eb26409054f23",
            "64c25e5c3c89452e90049b7822df062d",
            "3c8b1993dd6940dfbecf9e9c3212b776",
            "19954b32467d494486ba0d8e7aab59c2",
            "6762f9595abb4fdc8baf555a275fafcc",
            "b341c0b9427d48d3820071b74e946a2e",
            "e5a997430510492b8c29a46c4dc51f13",
            "87977a8914404096b44df3a303a851fc",
            "f52234039aa5453aa282aa52f28d7dc7"
          ]
        },
        "id": "kj18hV5SgTQ6",
        "outputId": "f0675624-8ff7-46e3-ba18-afb7500f87c6"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "595a13d607884046acf2a81b0d329510",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading readme:   0%|          | 0.00/2.26k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0210760e95f04099990c786d9a0d39c8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "1debb1769d3944e2aeb3f5ad9b67f413",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading data:   0%|          | 0.00/1.04G [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e66f580d79ac486d94405032dc946330",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "73c95d356a394c01a4ac58af8d4b0e08",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Generating train split: 0 examples [00:00, ? examples/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        }
      ],
      "source": [
        "from datasets import load_dataset\n",
        "\n",
        "# load the dataset and convert to pandas dataframe\n",
        "df = load_dataset(\n",
        "    \"fabiochiu/medium-articles\",\n",
        "    data_files=\"medium_articles.csv\",\n",
        "    split=\"train\"\n",
        ").to_pandas()"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 3,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 337
        },
        "id": "K8JvjJFgBTiP",
        "outputId": "8327d5d3-3993-4a9c-8eb2-f832346e4d95"
      },
      "outputs": [
        {
          "data": {
            "text/html": [
              "\n",
              "\n",
              "  <div id=\"df-739a4c27-9026-435f-8046-c9273a16f834\">\n",
              "    <div class=\"colab-df-container\">\n",
              "      <div>\n",
              "<style scoped>\n",
              "    .dataframe tbody tr th:only-of-type {\n",
              "        vertical-align: middle;\n",
              "    }\n",
              "\n",
              "    .dataframe tbody tr th {\n",
              "        vertical-align: top;\n",
              "    }\n",
              "\n",
              "    .dataframe thead th {\n",
              "        text-align: right;\n",
              "    }\n",
              "</style>\n",
              "<table border=\"1\" class=\"dataframe\">\n",
              "  <thead>\n",
              "    <tr style=\"text-align: right;\">\n",
              "      <th></th>\n",
              "      <th>title</th>\n",
              "      <th>text</th>\n",
              "      <th>url</th>\n",
              "      <th>authors</th>\n",
              "      <th>timestamp</th>\n",
              "      <th>tags</th>\n",
              "    </tr>\n",
              "  </thead>\n",
              "  <tbody>\n",
              "    <tr>\n",
              "      <th>4172</th>\n",
              "      <td>How the Data Stole Christmas</td>\n",
              "      <td>by Anonymous\\n\\nThe door sprung open and our t...</td>\n",
              "      <td>https://medium.com/data-ops/how-the-data-stole...</td>\n",
              "      <td>[]</td>\n",
              "      <td>2019-12-24 13:22:33.143000+00:00</td>\n",
              "      <td>['Data Science', 'Big Data', 'Dataops', 'Analy...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>174868</th>\n",
              "      <td>Automating Light Switch using the ESP32 Board ...</td>\n",
              "      <td>A story about how I escaped the boring task th...</td>\n",
              "      <td>https://python.plainenglish.io/automating-ligh...</td>\n",
              "      <td>['Tomas Rasymas']</td>\n",
              "      <td>2021-09-14 07:20:52.342000+00:00</td>\n",
              "      <td>['Programming', 'Python', 'Software Developmen...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>100171</th>\n",
              "      <td>Keep Going Quotes Sayings for When Hope is Lost</td>\n",
              "      <td>It\u2019s a very thrilling thing to achieve a goal....</td>\n",
              "      <td>https://medium.com/@yourselfquotes/keep-going-...</td>\n",
              "      <td>['Yourself Quotes']</td>\n",
              "      <td>2021-01-05 12:13:04.018000+00:00</td>\n",
              "      <td>['Quotes']</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>141757</th>\n",
              "      <td>When Will the Smoke Clear From Bay Area Skies?</td>\n",
              "      <td>Bay Area cities are contending with some of th...</td>\n",
              "      <td>https://thebolditalic.com/when-will-the-smoke-...</td>\n",
              "      <td>['Matt Charnock']</td>\n",
              "      <td>2020-09-15 22:38:33.924000+00:00</td>\n",
              "      <td>['Bay Area', 'San Francisco', 'California', 'W...</td>\n",
              "    </tr>\n",
              "    <tr>\n",
              "      <th>183489</th>\n",
              "      <td>The ABC\u2019s of Sustainability\u2026 easy as 1, 2, 3</td>\n",
              "      <td>By Julia DiPrete\\n\\n(according to the Jackson ...</td>\n",
              "      <td>https://medium.com/sipwines/the-abcs-of-sustai...</td>\n",
              "      <td>['Sip Wines']</td>\n",
              "      <td>2021-03-02 23:39:49.948000+00:00</td>\n",
              "      <td>['Wine Tasting', 'Sustainability', 'Wine']</td>\n",
              "    </tr>\n",
              "  </tbody>\n",
              "</table>\n",
              "</div>\n",
              "      <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-739a4c27-9026-435f-8046-c9273a16f834')\"\n",
              "              title=\"Convert this dataframe to an interactive table.\"\n",
              "              style=\"display:none;\">\n",
              "\n",
              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "       width=\"24px\">\n",
              "    <path d=\"M0 0h24v24H0V0z\" fill=\"none\"/>\n",
              "    <path d=\"M18.56 5.44l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94zm-11 1L8.5 8.5l.94-2.06 2.06-.94-2.06-.94L8.5 2.5l-.94 2.06-2.06.94zm10 10l.94 2.06.94-2.06 2.06-.94-2.06-.94-.94-2.06-.94 2.06-2.06.94z\"/><path d=\"M17.41 7.96l-1.37-1.37c-.4-.4-.92-.59-1.43-.59-.52 0-1.04.2-1.43.59L10.3 9.45l-7.72 7.72c-.78.78-.78 2.05 0 2.83L4 21.41c.39.39.9.59 1.41.59.51 0 1.02-.2 1.41-.59l7.78-7.78 2.81-2.81c.8-.78.8-2.07 0-2.86zM5.41 20L4 18.59l7.72-7.72 1.47 1.35L5.41 20z\"/>\n",
              "  </svg>\n",
              "      </button>\n",
              "\n",
              "\n",
              "\n",
              "    <div id=\"df-3a68597a-f34a-4e46-a9f5-139104264379\">\n",
              "      <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-3a68597a-f34a-4e46-a9f5-139104264379')\"\n",
              "              title=\"Suggest charts.\"\n",
              "              style=\"display:none;\">\n",
              "\n",
              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
              "     width=\"24px\">\n",
              "    <g>\n",
              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
              "    </g>\n",
              "</svg>\n",
              "      </button>\n",
              "    </div>\n",
              "\n",
              "<style>\n",
              "  .colab-df-quickchart {\n",
              "    background-color: #E8F0FE;\n",
              "    border: none;\n",
              "    border-radius: 50%;\n",
              "    cursor: pointer;\n",
              "    display: none;\n",
              "    fill: #1967D2;\n",
              "    height: 32px;\n",
              "    padding: 0 0 0 0;\n",
              "    width: 32px;\n",
              "  }\n",
              "\n",
              "  .colab-df-quickchart:hover {\n",
              "    background-color: #E2EBFA;\n",
              "    box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "    fill: #174EA6;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart {\n",
              "    background-color: #3B4455;\n",
              "    fill: #D2E3FC;\n",
              "  }\n",
              "\n",
              "  [theme=dark] .colab-df-quickchart:hover {\n",
              "    background-color: #434B5C;\n",
              "    box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "    filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "    fill: #FFFFFF;\n",
              "  }\n",
              "</style>\n",
              "\n",
              "    <script>\n",
              "      async function quickchart(key) {\n",
              "        const containerElement = document.querySelector('#' + key);\n",
              "        const charts = await google.colab.kernel.invokeFunction(\n",
              "            'suggestCharts', [key], {});\n",
              "      }\n",
              "    </script>\n",
              "\n",
              "      <script>\n",
              "\n",
              "function displayQuickchartButton(domScope) {\n",
              "  let quickchartButtonEl =\n",
              "    domScope.querySelector('#df-3a68597a-f34a-4e46-a9f5-139104264379 button.colab-df-quickchart');\n",
              "  quickchartButtonEl.style.display =\n",
              "    google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "}\n",
              "\n",
              "        displayQuickchartButton(document);\n",
              "      </script>\n",
              "      <style>\n",
              "    .colab-df-container {\n",
              "      display:flex;\n",
              "      flex-wrap:wrap;\n",
              "      gap: 12px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert {\n",
              "      background-color: #E8F0FE;\n",
              "      border: none;\n",
              "      border-radius: 50%;\n",
              "      cursor: pointer;\n",
              "      display: none;\n",
              "      fill: #1967D2;\n",
              "      height: 32px;\n",
              "      padding: 0 0 0 0;\n",
              "      width: 32px;\n",
              "    }\n",
              "\n",
              "    .colab-df-convert:hover {\n",
              "      background-color: #E2EBFA;\n",
              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
              "      fill: #174EA6;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert {\n",
              "      background-color: #3B4455;\n",
              "      fill: #D2E3FC;\n",
              "    }\n",
              "\n",
              "    [theme=dark] .colab-df-convert:hover {\n",
              "      background-color: #434B5C;\n",
              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
              "      fill: #FFFFFF;\n",
              "    }\n",
              "  </style>\n",
              "\n",
              "      <script>\n",
              "        const buttonEl =\n",
              "          document.querySelector('#df-739a4c27-9026-435f-8046-c9273a16f834 button.colab-df-convert');\n",
              "        buttonEl.style.display =\n",
              "          google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
              "\n",
              "        async function convertToInteractive(key) {\n",
              "          const element = document.querySelector('#df-739a4c27-9026-435f-8046-c9273a16f834');\n",
              "          const dataTable =\n",
              "            await google.colab.kernel.invokeFunction('convertToInteractive',\n",
              "                                                     [key], {});\n",
              "          if (!dataTable) return;\n",
              "\n",
              "          const docLinkHtml = 'Like what you see? Visit the ' +\n",
              "            '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
              "            + ' to learn more about interactive tables.';\n",
              "          element.innerHTML = '';\n",
              "          dataTable['output_type'] = 'display_data';\n",
              "          await google.colab.output.renderOutput(dataTable, element);\n",
              "          const docLink = document.createElement('div');\n",
              "          docLink.innerHTML = docLinkHtml;\n",
              "          element.appendChild(docLink);\n",
              "        }\n",
              "      </script>\n",
              "    </div>\n",
              "  </div>\n"
            ],
            "text/plain": [
              "                                                    title  \\\n",
              "4172                         How the Data Stole Christmas   \n",
              "174868  Automating Light Switch using the ESP32 Board ...   \n",
              "100171    Keep Going Quotes Sayings for When Hope is Lost   \n",
              "141757     When Will the Smoke Clear From Bay Area Skies?   \n",
              "183489       The ABC\u2019s of Sustainability\u2026 easy as 1, 2, 3   \n",
              "\n",
              "                                                     text  \\\n",
              "4172    by Anonymous\\n\\nThe door sprung open and our t...   \n",
              "174868  A story about how I escaped the boring task th...   \n",
              "100171  It\u2019s a very thrilling thing to achieve a goal....   \n",
              "141757  Bay Area cities are contending with some of th...   \n",
              "183489  By Julia DiPrete\\n\\n(according to the Jackson ...   \n",
              "\n",
              "                                                      url  \\\n",
              "4172    https://medium.com/data-ops/how-the-data-stole...   \n",
              "174868  https://python.plainenglish.io/automating-ligh...   \n",
              "100171  https://medium.com/@yourselfquotes/keep-going-...   \n",
              "141757  https://thebolditalic.com/when-will-the-smoke-...   \n",
              "183489  https://medium.com/sipwines/the-abcs-of-sustai...   \n",
              "\n",
              "                    authors                         timestamp  \\\n",
              "4172                     []  2019-12-24 13:22:33.143000+00:00   \n",
              "174868    ['Tomas Rasymas']  2021-09-14 07:20:52.342000+00:00   \n",
              "100171  ['Yourself Quotes']  2021-01-05 12:13:04.018000+00:00   \n",
              "141757    ['Matt Charnock']  2020-09-15 22:38:33.924000+00:00   \n",
              "183489        ['Sip Wines']  2021-03-02 23:39:49.948000+00:00   \n",
              "\n",
              "                                                     tags  \n",
              "4172    ['Data Science', 'Big Data', 'Dataops', 'Analy...  \n",
              "174868  ['Programming', 'Python', 'Software Developmen...  \n",
              "100171                                         ['Quotes']  \n",
              "141757  ['Bay Area', 'San Francisco', 'California', 'W...  \n",
              "183489         ['Wine Tasting', 'Sustainability', 'Wine']  "
            ]
          },
          "execution_count": 3,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "# drop empty rows and select 20k articles\n",
        "df = df.dropna().sample(20000, random_state=32)\n",
        "df.head()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "NY8gj97qm3WU"
      },
      "source": [
        "We will use the article title and its text for generating embeddings. For that, we join the article title and the first 1000 characters from the article text."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 4,
      "metadata": {
        "id": "Sb2bEL7YOMjr"
      },
      "outputs": [],
      "source": [
        "# select first 1000 characters\n",
        "df[\"text\"] = df[\"text\"].str[:1000]\n",
        "# join article title and the text\n",
        "df[\"title_text\"] = df[\"title\"] + \". \" + df[\"text\"]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ciqKCl9gbN7s"
      },
      "source": [
        "# Initialize NER Model"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "9pZkI5KIRYkE"
      },
      "source": [
        "To extract named entities, we will use a NER model finetuned on a BERT-base model. The model can be loaded from the HuggingFace model hub as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 5,
      "metadata": {
        "id": "kyu5qdDMooua"
      },
      "outputs": [],
      "source": [
        "import torch\n",
        "\n",
        "# set device to GPU if available\n",
        "device = torch.cuda.current_device() if torch.cuda.is_available() else None"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 6,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 281,
          "referenced_widgets": [
            "574bebb9c591418da4274cd6dd2adcb2",
            "fbfeee08972f40cda6a16793554ae8d6",
            "cc24c0737e3d4c90b0273566bb3b1795",
            "951c2f577f4e4f1c9b5e570f6eda34db",
            "45d5fa1cd27d4348a581b0ca193e2f36",
            "c55758a905e3454dbe231beae355b6e9",
            "1cb8abc5abc246a3aabfbdb2a5574c07",
            "b0aaafae83de455699caa7276ddd429e",
            "84ebb0599eb94819a47e91caebc4ca38",
            "3d303b0386894f179cbb353f6f35268e",
            "7338ef4fbea9494198e0b57475cfe58f",
            "57d3f1e6eea04c8cb9cd271917504b3d",
            "106e8db09a2a448baf06e090ccae8d9b",
            "8c0ba62486b4438abdc72c9e7aaa1f22",
            "27dc348a6644416cb8868bfd667fa256",
            "7d13274d9479473ebf75d8a78c267151",
            "bae741fc67ae4322b5870b8c3a3c3fa4",
            "a5a37dac254a42ee9861064df64cac17",
            "10a01ab2fec24ccc8fdd6c81484580aa",
            "08858dcc6727435c8e8d8a744e4257c3",
            "fe13250615aa41f88ce445507113d751",
            "aa65005be4a84c53af6cf057e6d2c8e5",
            "fce476c57e1244a385e4e343410af236",
            "4b40ed2c6902493e991b8f0bd16a836e",
            "2c229adc7e3b450d9d3b329920041f01",
            "9f9dda84183c4b3397a07980a7e65233",
            "1b75d1bdf23a4e07ab81d3c35dda4962",
            "e7c68a313682417c8c8178a5cc9f3049",
            "344ed337767942c89aa43546ff704f8b",
            "5cef876e18544cdbbb3bc321f5d7187f",
            "8d2d7bcde12e42d68bcc89a48e1979a2",
            "5ea72ae8b2244aa0a2f3f9d22fabd233",
            "42048b691ae04ad28d99f65a6ea0caa2",
            "376d5dd013324333a7571b27927a8baf",
            "ca2e68d4d0844d2c8b089a6cb17fb5eb",
            "01d8aaee42c245e7b551f1fdf3789f35",
            "0caf0db382e94df7a58d7e642d834591",
            "eb72bce763b24f9f9c454661968a75ab",
            "525a51e903ff43ea913adb46cf346d77",
            "eb9b9e47ab8842d4aa2ca84903ee59dd",
            "b9e151d13a2c4bf1b91eef94dabc8ad0",
            "2791bdaf4a3141bdabc7676970195103",
            "c7d5f4ac71c54fb98e6e19c3ee30ac45",
            "3adbfd8d3a4e4bdda4a38942f17df89a",
            "b9d79e6546db4094bfddb19d97b1e1b8",
            "25effbfad2684166b02c8f0d61860976",
            "b53e2930b2e944f4835daa6930d5e7bf",
            "a702a11a86df444dba4e7698bbdf619e",
            "1377c1884a46444790d1ff31cf8641e7",
            "00405c1b9ef8443dafce9a1b5780ecd7",
            "4b08df5bb8194ab2942df30df70d019c",
            "3e0eb347e0d94790be10cf845b99d86a",
            "a041b410b0e149b388fdaaf3ee3b7ced",
            "efe1ff06a8ad4980bc9c6238190ca54f",
            "c1cf30fe197a4cfe9cf7ca2bfa71fe15",
            "3d23564c87ac4c0495a996817f561c06",
            "4efc75ab99ae408f94d19caf0a0e7ec6",
            "4764b0436cc6486695b272b7ba012664",
            "c4ab062cead449a09fecd49ce4a40257",
            "029fa360bdfc48a090ba2bafd0bcd2c3",
            "b2f5f46384f540f6ba712ac32d69c334",
            "221ebdbcaa6f425a8822470047f7e2e4",
            "f24a24944b724066b703feba2ffc94bb",
            "fd9627f91efc430698a153551f57d605",
            "7c386cb2f56f4c30a508b5d60917c2db",
            "86f7f14b18c94d218b1e1e5736e49433"
          ]
        },
        "id": "PY7wu5f3_4GJ",
        "outputId": "d430fb94-efd0-4b6d-ca9b-0ca5cb95c2c6"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "574bebb9c591418da4274cd6dd2adcb2",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)okenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "57d3f1e6eea04c8cb9cd271917504b3d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)lve/main/config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "fce476c57e1244a385e4e343410af236",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)solve/main/vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "376d5dd013324333a7571b27927a8baf",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)in/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "b9d79e6546db4094bfddb19d97b1e1b8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "3d23564c87ac4c0495a996817f561c06",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']\n",
            "- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).\n",
            "- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).\n"
          ]
        }
      ],
      "source": [
        "from transformers import AutoTokenizer, AutoModelForTokenClassification\n",
        "from transformers import pipeline\n",
        "\n",
        "model_id = \"dslim/bert-base-NER\"\n",
        "\n",
        "# load the tokenizer from huggingface\n",
        "tokenizer = AutoTokenizer.from_pretrained(\n",
        "    model_id\n",
        ")\n",
        "# load the NER model from huggingface\n",
        "model = AutoModelForTokenClassification.from_pretrained(\n",
        "    model_id\n",
        ")\n",
        "# load the tokenizer and model into a NER pipeline\n",
        "nlp = pipeline(\n",
        "    \"ner\",\n",
        "    model=model,\n",
        "    tokenizer=tokenizer,\n",
        "    aggregation_strategy=\"max\",\n",
        "    device=device\n",
        ")"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 7,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "1OIXMIx2_7AA",
        "outputId": "d224ada7-5db3-4e18-eb2d-66b02a5aae38"
      },
      "outputs": [
        {
          "data": {
            "text/plain": [
              "[{'entity_group': 'LOC',\n",
              "  'score': 0.9996493,\n",
              "  'word': 'London',\n",
              "  'start': 0,\n",
              "  'end': 6},\n",
              " {'entity_group': 'LOC',\n",
              "  'score': 0.9997588,\n",
              "  'word': 'England',\n",
              "  'start': 25,\n",
              "  'end': 32},\n",
              " {'entity_group': 'LOC',\n",
              "  'score': 0.9993923,\n",
              "  'word': 'United Kingdom',\n",
              "  'start': 41,\n",
              "  'end': 55}]"
            ]
          },
          "execution_count": 7,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "text = \"London is the capital of England and the United Kingdom\"\n",
        "# use the NER pipeline to extract named entities from the text\n",
        "nlp(text)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "cb5Ln9SHS3cu"
      },
      "source": [
        "Our NER pipeline is working as expected and accurately extracting entities from the text."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vLK8AklElh_H"
      },
      "source": [
        "# Initialize Retriever"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "OaruW2fdTJFD"
      },
      "source": [
        "A retriever model is used to embed passages (article title + first 1000 characters) and queries. It creates embeddings such that queries and passages with similar meanings are close in the vector space. We will use a sentence-transformer model as our retriever. The model can be loaded as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 8,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 552,
          "referenced_widgets": [
            "2482617b01824ba883840d0a5733244d",
            "1a39edf721e24980b287e46bc9ea83ab",
            "a1f7619a46c44f009176118b1b219523",
            "97d659c7a3284c86b95919165731dbc9",
            "d13d5d629f6e4cfda1218b870495209f",
            "56524a85b4c949b0b8386a1209be9199",
            "147181fe769b4d69a55d3e5137f4be07",
            "02989f2223c34c6296556877d67f67f8",
            "cfcf61517eaa4581a7bd7399c7129d6a",
            "b2e4abc7083447d491a59aa30c31b850",
            "b16f6a9efdc14ee99a57b63999c6edbb",
            "d971f9fdff02467980219cef9e1ff242",
            "8d86f97e14924df5b7caeb82111c025c",
            "0519e24481e845d592b9a809f07cad47",
            "3ca32f2bedc643edb9a6d81f9a87828f",
            "ddf04c2c150748388416678c813afc5a",
            "a00d748ebeaf4a8f8f1909349ad64a36",
            "408217b45d384aa888c939b454b07a64",
            "e4783aeb9a5f4299abc6f354d6ff9e9d",
            "765091d5ad1e42bd986c56a0035e254b",
            "cd1d3da7dd0a4a22bc27aed3f05fbecf",
            "b7a1330a115a4f55bf53d5f5b68e9d1c",
            "e228fc0b479742ac9ae648e51644dbf5",
            "aded426db59e44929c297460cc88164f",
            "fb5c1d67572f4088b112055458742b1d",
            "9eab5c3db2174e63a25b3e2aabe2e45d",
            "77e1807f22d24b0b85084ed556ec588a",
            "dbf4db864d594479a5c3f0d9e543913c",
            "fa7005cf8d5b4a0f9d593e8082678cac",
            "ab9f48c863be44598e26d45e84adc06b",
            "830bb23892f6423194aeb8afd47457e8",
            "c0bbfe587e4d461d82b7d142598fb492",
            "227e7acf86374d69bd0e80df4599b123",
            "821be6088ae0424d9607838a22c2b0e8",
            "c2e0e64bf0cf4d08a38684539d8d5030",
            "0812e26da84542b0af6f18d8fc6eba2a",
            "ce6585a9ff3e4ccc9d23dca7ba8af659",
            "307cb07311524116a19d0a9703763aec",
            "10e80d15cc624da8ac705bb56edf3c6f",
            "ddd70166d760497e89aa54fd74b9be4c",
            "94a5c9a2c89a432f96d464d203b063c7",
            "0fdbc8baccce42eca200344253a948ac",
            "6a061da601394cecbca1413d3401b14e",
            "1801647518144c6b9f0900e793561385",
            "e161dafcbf1a4396bdf09cad59fa465f",
            "308c2c85375e44028f4f35a9eaaf26ba",
            "71804c51f5064cec910af4e53d161a7d",
            "5d8d1a11437e48e9ade5b4b4e2191739",
            "2f9575fbd7804bdc91a71cb8d1ada6c0",
            "72956446bbcb445b97e08a42481ac81e",
            "9f041dbc454f4308a6713f898c981c59",
            "3bba1b367e1846278fc9a232e9a72317",
            "33550c30488043b4b7e1306b98f32f99",
            "788556c86536459b8e35910045dbe8b6",
            "6c67f7fb418243f08bfb8b5fd3d47d35",
            "352f2df4ade34ea29595b042fbb740b0",
            "380bf971a8bb4f328c42490f295bea77",
            "8b7d6b5a265c46e89c6d4e3b9b3ad30e",
            "c71ef1f4766f49ccb77a9bf58c52e9f1",
            "9e8537809c1b47f69f05323ca885c14f",
            "23a45d6f943843f39fc4daf8a6ead315",
            "a6b45ac30875482e92cdfb9803500ebf",
            "893d277d03184c33b45b6766f018cfd5",
            "65f7d26377dc4eb198f21c99176e8e24",
            "3a6b67c2eec34edc86c7e52394c7b5d8",
            "92ea41b5926f48cb80cab9254183e160",
            "8c4f4bbcab5e4a4e8a3ac97b57f37ada",
            "d542a4e3b444453e958316f6dbcb0e62",
            "c6107733a23542148855e644044d3a8d",
            "4b8a73f127d7408ba6b30f32cf1386ee",
            "c5d07b012fa744628dccc132b917b723",
            "24e6824b944244649a243070e76aea97",
            "587fd3002c6a42ff850502c9adebf26f",
            "b78b8d473b434d3fa7f1b58e72052727",
            "1d770d550f7d4a58a4c2efddedf53d16",
            "cfb14e18fd024474868b249a6b897138",
            "dd3306a45c684d4da3b1cae05355378f",
            "8db74dd036404d8dbd865bd3c328efa8",
            "c4b3f309763548fea8c40a9a1bd7c842",
            "a91991157f9041dab6dad2d015f3b39c",
            "61e5da1beb394dc59a7505fb526874f9",
            "68e8f67b092a45bebe4d0f5c7547ba14",
            "a3462b8d380342049a582fe188375856",
            "3fce2f8fe3ba4990baf4e8e96facc869",
            "15929cb6c02f472ca7a81e758c40596b",
            "81ed22476aa14b5e90a40c2219de04da",
            "5619697884ce46359ac5f426f1346a6f",
            "b5f69ede5dea423fb03c62061eef2a82",
            "cf79881c5b3248159c38cc960873272a",
            "30fd6ad6a0f944048739f6ccccba6929",
            "e259aa5f1774489f8cc9590ed844a390",
            "63e63fbd52c2497a9979ac3739a4de71",
            "caccee9f211444b38187677c49f888c7",
            "f84a2bb38493403191b9286d6e04fe82",
            "2814540e0a0c4f57966a21534422d0ee",
            "24a51f6c7c93411d8c4456dca14ad114",
            "f500c3f8de3041038ce4fff19715c5ff",
            "0732276bc364415194926c1dd30368ff",
            "60a50cbc52a5490985130ece4d954b45",
            "0c5d87f0b96d40ee8f983e1ebdba4789",
            "9bdbf562a39a48b5ab321a15d64f854f",
            "f638a1440f30422b954ea1de1b2aea73",
            "1b4fcf7eb6f84742b29c3e54dff6fb3b",
            "197734cb994941139226c6a01c84bbcc",
            "e653a960ba7d4110a093989b3d70688f",
            "99e1c95bf13c4350b381e8dfe2e0eec4",
            "fda89917b27042958feca2ab192705dc",
            "f68eeacd48824f7bb547429d0d73a528",
            "4fa741c6b9c4401e98c795083a5da881",
            "4544f5d7f61f4a8fac6d86e70ab77c6b",
            "4554c38a24b54014b6e0970bc06434ee",
            "126c8d47f7074f9a8f0618ae4b61935f",
            "4a0e9ffb4c6d40529ddd420c22a9e218",
            "249a5c128cc54ddb84bed16105b402fb",
            "3b952cd89e144c7f83ae84310f3d93d0",
            "5407fa585a024d9298b73486c1a2b082",
            "896588b21bf2474dba25aef44a41dd28",
            "e119673d3b5a4722aa7f244b5b051ba5",
            "4dd9dd7ecff64f9a80804990d1af104f",
            "70cb55134ede4519bc5a6baabc46d56d",
            "e16e013f92134ea79bb3884cad057892",
            "03bc7be5388940e6b190b08dd9025ccd",
            "cc35014cc2364d5ca2ec6533a2144882",
            "2f8573e073f8483f86ca6fdfd2a2f799",
            "57596f1ac1ca469792bfbca8f81c48e8",
            "2566b0270b8f49618d33c4835274b84c",
            "f90bd03c125d43b0b4d9e033f93899d8",
            "3b6a458aedd34ef382f78d2636e3357e",
            "702344bded7046df9288f5e4da248acf",
            "25a57c3b03594bb986890a6d2ae9382a",
            "c196fd1014a64f3db04cb81032a85eec",
            "4404af5c0d81441e82a625afe3470a1b",
            "873ee357d0564435b5c29a194abc6ed7",
            "065fc65bdba54767bb2e69cb5f1175b8",
            "be23d757a0124696bdc34cae500d3390",
            "4f6ee8c6bd974323818483cc3011df71",
            "d8b580c4abc44e47ad4db9feb43b6865",
            "0efd76f5072641cfb80b23f5d112bea8",
            "5f15d4bed1cb4bd1b62b0b33dfdc8757",
            "2bc73a4b23cc422db9f94e558bedcd5e",
            "f8740d6b3e584efd9707d7fc5d073666",
            "a8e484a008eb470d8aa480467377ed99",
            "f66daf6b623e48f2b918a55890f8dd79",
            "4878cefc450140cb86786652494a6e9f",
            "d9481584cdeb44109f0a9cbca647fa53",
            "46757c095dc1445596b183a61c88bc94",
            "80eb2da22676401a82236ed33fd63bdf",
            "9081323cd36048dba4fe59da77ba41f9",
            "db82a6486ef34aa59c5bacff901c050c",
            "6a95a389a2684e4a9d0d0858d9d3b7a0",
            "9d794b00a5764528a40b433b90993719",
            "42250a042d3546589aacdf49ac46e752",
            "1f2e67390ef5410b9ace57ddf4b60dc7",
            "ca3bc5d92bc84767aa09d053628b47e9"
          ]
        },
        "id": "6fdGP3a7HhWe",
        "outputId": "969970e6-3c6e-475d-fda2-a76f33af7ea0"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "2482617b01824ba883840d0a5733244d",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)e933c/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "d971f9fdff02467980219cef9e1ff242",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e228fc0b479742ac9ae648e51644dbf5",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)cbe6ee933c/README.md:   0%|          | 0.00/9.85k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "821be6088ae0424d9607838a22c2b0e8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)e6ee933c/config.json:   0%|          | 0.00/591 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "e161dafcbf1a4396bdf09cad59fa465f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "352f2df4ade34ea29595b042fbb740b0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)33c/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8c4f4bbcab5e4a4e8a3ac97b57f37ada",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "8db74dd036404d8dbd865bd3c328efa8",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "cf79881c5b3248159c38cc960873272a",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "0c5d87f0b96d40ee8f983e1ebdba4789",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)e933c/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4554c38a24b54014b6e0970bc06434ee",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "03bc7be5388940e6b190b08dd9025ccd",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)933c/train_script.py:   0%|          | 0.00/13.2k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "873ee357d0564435b5c29a194abc6ed7",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)cbe6ee933c/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "4878cefc450140cb86786652494a6e9f",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "Downloading (\u2026)6ee933c/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "SentenceTransformer(\n",
              "  (0): Transformer({'max_seq_length': 128, 'do_lower_case': False}) with Transformer model: MPNetModel \n",
              "  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})\n",
              "  (2): Normalize()\n",
              ")"
            ]
          },
          "execution_count": 8,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from sentence_transformers import SentenceTransformer\n",
        "\n",
        "# load the model from huggingface\n",
        "retriever = SentenceTransformer(\n",
        "    'flax-sentence-embeddings/all_datasets_v3_mpnet-base',\n",
        "    device=device\n",
        ")\n",
        "retriever"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "kAQouih7Kc9X"
      },
      "source": [
        "# Initialize Pinecone Index"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pI9ENCKlTq6m"
      },
      "source": [
        "Now we need to initialize our Pinecone index. The Pinecone index stores vector representations of our passages which we can retrieve using another vector (the query vector). We first need to initialize our connection to Pinecone. For this, we need a free [API key](https://app.pinecone.io/); you can find your environment in the [Pinecone console](https://app.pinecone.io) under **API Keys**. We initialize the connection like so:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import os\n",
        "from pinecone import Pinecone\n",
        "\n",
        "# initialize connection to pinecone (get API key at app.pinecone.io)\n",
        "api_key = os.environ.get('PINECONE_API_KEY') or 'PINECONE_API_KEY'\n",
        "\n",
        "# configure client\n",
        "pc = Pinecone(api_key=api_key)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "Now we setup our index specification, this allows us to define the cloud provider and region where we want to deploy our index. You can find a list of all [available providers and regions here](https://docs.pinecone.io/docs/projects)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "from pinecone import ServerlessSpec\n",
        "\n",
        "cloud = os.environ.get('PINECONE_CLOUD') or 'aws'\n",
        "region = os.environ.get('PINECONE_REGION') or 'us-east-1'\n",
        "\n",
        "spec = ServerlessSpec(cloud=cloud, region=region)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "bWXiQV0qUqv3"
      },
      "source": [
        "Now we can create our vector index. We will name it `ner-search` (feel free to chose any name you prefer). We specify the metric type as `cosine` and dimension as `768` as these are the vector space and dimensionality of the vectors output by the retriever model."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "index_name = \"ner-search\""
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {},
      "outputs": [],
      "source": [
        "import time\n",
        "\n",
        "# check if index already exists (it shouldn't if this is first time)\n",
        "if index_name not in pc.list_indexes().names():\n",
        "    # if does not exist, create index\n",
        "    pc.create_index(\n",
        "        index_name,\n",
        "        dimension=768,\n",
        "        metric='cosine',\n",
        "        spec=spec\n",
        "    )\n",
        "    # wait for index to be initialized\n",
        "    while not pc.describe_index(index_name).status['ready']:\n",
        "        time.sleep(1)\n",
        "\n",
        "# connect to index\n",
        "index = pc.Index(index_name)\n",
        "# view index stats\n",
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "X0zu4ebVYvVi"
      },
      "source": [
        "# Generate Embeddings and Upsert"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "vd6E9SEt4pM5"
      },
      "source": [
        "We generate embeddings for the `title_text` column we created earlier. Alongside the embeddings, we also include the named entities in the index as metadata. Later we will apply a filter based on these named entities when executing queries.\n",
        "\n",
        "Let's first write a helper function to extract named entities from a batch of text."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 13,
      "metadata": {
        "id": "0Vk1FW45LkCa"
      },
      "outputs": [],
      "source": [
        "def extract_named_entities(text_batch):\n",
        "    # extract named entities using the NER pipeline\n",
        "    extracted_batch = nlp(text_batch)\n",
        "    entities = []\n",
        "    # loop through the results and only select the entity names\n",
        "    for text in extracted_batch:\n",
        "        ne = [entity[\"word\"] for entity in text]\n",
        "        entities.append(ne)\n",
        "    return entities"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "zjDSZh9Ynuc7"
      },
      "source": [
        "Now we create the embeddings. We do this in batches of `64` to avoid overwhelming machine resources or API request limits."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 14,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 118,
          "referenced_widgets": [
            "690af3bb20854c3fb3291cfc0de55bd0",
            "9db526a12ae84cbd923a02f014a01299",
            "1c17f94cfac14c70aaa35b0aeb100612",
            "c76a2da25d7b4ad098fb6978646d0ad3",
            "89d813b345f94053b508deb6ee8a6e11",
            "beae74fa9ce94856a90597743d148615",
            "60f03f5e0d5d437ab572c96e5ed4c8e4",
            "1032cf263b854095871281f80e31e6d9",
            "85c89b8f1b014ef8819b11875b499f3e",
            "f198aedadad64d9c96f469c725080ea1",
            "8de8cec176a846af8e293f568caec461"
          ]
        },
        "id": "MZ6JP50wSm9o",
        "outputId": "e4a74da5-2746-4e46-c2ec-956c1396ab90"
      },
      "outputs": [
        {
          "data": {
            "application/vnd.jupyter.widget-view+json": {
              "model_id": "690af3bb20854c3fb3291cfc0de55bd0",
              "version_major": 2,
              "version_minor": 0
            },
            "text/plain": [
              "  0%|          | 0/313 [00:00<?, ?it/s]"
            ]
          },
          "metadata": {},
          "output_type": "display_data"
        },
        {
          "data": {
            "text/plain": [
              "{'dimension': 768,\n",
              " 'index_fullness': 0.19776,\n",
              " 'namespaces': {'': {'vector_count': 19776}},\n",
              " 'total_vector_count': 19776}"
            ]
          },
          "execution_count": 14,
          "metadata": {},
          "output_type": "execute_result"
        }
      ],
      "source": [
        "from tqdm.auto import tqdm\n",
        "import warnings\n",
        "warnings.filterwarnings('ignore', category=UserWarning)\n",
        "\n",
        "# we will use batches of 64\n",
        "batch_size = 64\n",
        "\n",
        "for i in tqdm(range(0, len(df), batch_size)):\n",
        "    # find end of batch\n",
        "    i_end = min(i+batch_size, len(df))\n",
        "    # extract batch\n",
        "    batch = df.iloc[i:i_end].copy()\n",
        "    # generate embeddings for batch\n",
        "    emb = retriever.encode(batch[\"title_text\"].tolist()).tolist()\n",
        "    # extract named entities from the batch\n",
        "    entities = extract_named_entities(batch[\"title_text\"].tolist())\n",
        "    # remove duplicate entities from each record\n",
        "    batch[\"named_entities\"] = [list(set(entity)) for entity in entities]\n",
        "    batch = batch.drop('title_text', axis=1)\n",
        "    # get metadata\n",
        "    meta = batch.to_dict(orient=\"records\")\n",
        "    # create unique IDs\n",
        "    ids = [f\"{idx}\" for idx in range(i, i_end)]\n",
        "    # add all to upsert list\n",
        "    to_upsert = list(zip(ids, emb, meta))\n",
        "    # upsert/insert these records to pinecone\n",
        "    _ = index.upsert(vectors=to_upsert)\n",
        "\n",
        "# check that we have all vectors in index\n",
        "index.describe_index_stats()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "mMmWKq6VP9v2"
      },
      "source": [
        "Now we have indexed the articles and relevant metadata. We can move on to querying."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Zu-vflsWQfCQ"
      },
      "source": [
        "# Querying"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "nYWleiiiX1JK"
      },
      "source": [
        "First, we will write a helper function to handle the queries."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 15,
      "metadata": {
        "id": "5ROdRPTUX0LI"
      },
      "outputs": [],
      "source": [
        "from pprint import pprint\n",
        "\n",
        "def search_pinecone(query):\n",
        "    # extract named entities from the query\n",
        "    ne = extract_named_entities([query])[0]\n",
        "    # create embeddings for the query\n",
        "    xq = retriever.encode(query).tolist()\n",
        "    # query the pinecone index while applying named entity filter\n",
        "    xc = index.query(xq, top_k=10, include_metadata=True, filter={\"named_entities\": {\"$in\": ne}})\n",
        "    # extract article titles from the search result\n",
        "    r = [x[\"metadata\"][\"title\"] for x in xc[\"matches\"]]\n",
        "    return pprint({\"Extracted Named Entities\": ne, \"Result\": r})"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "iu1ddk-Nnuc8"
      },
      "source": [
        "Now try a query."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 16,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "nAGsfLQpXqCl",
        "outputId": "c8d846c4-8a0a-41c4-ed2f-73abe0d17c92"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'Extracted Named Entities': ['Greece'],\n",
            " 'Result': ['Budget-Friendly Holidays: Visit The Best Summer Destinations In '\n",
            "            'Greece | easyGuide',\n",
            "            'Exploring Greece',\n",
            "            'The Search for Best Villas in Greece for Rental Ends Here | '\n",
            "            'Alasvillas | Greece',\n",
            "            'Perip\u00e9teies in Greece \u2014 Week 31. Adventures in Greece as we '\n",
            "            'pursue the\u2026',\n",
            "            'Greece has its own Dominic Cummings \u2014 and things are about to get '\n",
            "            'scary',\n",
            "            'Our stay at Ormos Marathokampos in Samos',\n",
            "            'Reintroducing Greece',\n",
            "            'Letting go in Greece',\n",
            "            'AYS Daily Digest 13/03/20: People removed from Greek islands '\n",
            "            'without a chance to seek asylum',\n",
            "            'True Crime Addiction Newsletter']}\n"
          ]
        }
      ],
      "source": [
        "query = \"What are the best places to visit in Greece?\"\n",
        "search_pinecone(query)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 17,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "F7BPaymZZsx7",
        "outputId": "9133a16a-4913-4dd8-800c-33585b7d662d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'Extracted Named Entities': ['London'],\n",
            " 'Result': ['Historical places to visit in London',\n",
            "            'You\u2019ll never look at London the same way again after playing '\n",
            "            'Pokemon GO',\n",
            "            'Primrose and Regent\u2019s Park London Walk \u2014 Portraits in the City',\n",
            "            'The Building of London',\n",
            "            '9 Workspaces in London Perfect for Startups : HotPatch',\n",
            "            'Cinema-going in Covid London',\n",
            "            'Don\u2019t miss the scenic route',\n",
            "            'The Beatnik Brit',\n",
            "            'World Destinations: The Most visited and Busiest Places in the '\n",
            "            'World at all Times',\n",
            "            'London Bar BrewDog Dumps Cash Payments for Bitcoin']}\n"
          ]
        }
      ],
      "source": [
        "query = \"What are the best places to visit in London?\"\n",
        "search_pinecone(query)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 18,
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "6fFymNVIRFJY",
        "outputId": "689053ab-3597-4954-9beb-eedb0159586d"
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "{'Extracted Named Entities': ['SpaceX', 'Mars'],\n",
            " 'Result': ['Mars Habitat: NASA 3D Printed Habitat Challenge',\n",
            "            'Reusable rockets and the robots at sea: The SpaceX story',\n",
            "            'Colonising Planets Beyond Mars',\n",
            "            'Musk Explained: The Musk approach to marketing',\n",
            "            'How We\u2019ll Access the Water on Mars',\n",
            "            'Chasing Immortality',\n",
            "            'Mission Possible: How Space Exploration Can Deliver Sustainable '\n",
            "            'Development',\n",
            "            'I Know I Shouldn\u2019t get Worked up Over a Meme',\n",
            "            'What If Mars Never Lost Its Water?',\n",
            "            'SpaceX inspiration4 all-civilian spaceflight: When to watch and '\n",
            "            'things to know']}\n"
          ]
        }
      ],
      "source": [
        "query = \"Why does SpaceX want to build a city on Mars?\"\n",
        "search_pinecone(query)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HDgsTI95nuc9"
      },
      "source": [
        "These all look like great results, making the most of Pinecone's advanced vector search capabilities while limiting search scope to relevant records only with a named entity filter."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 19,
      "metadata": {
        "id": "lswqV7uA8Un2"
      },
      "outputs": [],
      "source": [
        "pc.delete_index(index_name)"
      ]
    }
  ],
  "metadata": {
    "accelerator": "GPU",
    "colab": {
      "gpuType": "T4",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.8.5"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}